TASK 1: TITANIC SURVIVAL PREDICTION

Hero image

Task Description:¶

  • Use the Titanic dataset to build a model that predicts whether a passenger on Titanic survived or not. This is a classic beginner project with readily available data.
  • The dataset typically used for this project contains nformation about individual passengers such as their age, gender, ticket class, fare, cabin, and whether or not they survived.

About the Dataset:¶

The Titanic Dataset link is a dataset curated on the basis of the passengers on titanic, like their age, class, gender, etc to predict if they would have survived or not. It contains both numerical and string values. It has 12 predefined columns which are as below:

  • Passenger ID - To identify unique passengers
  • Survived - If they survived or not
  • PClass - The class passengers travelled in
  • Name - Passenger Name
  • Sex - Gender of Passenger
  • Age - Age of passenger
  • SibSp - Number of siblings or spouse
  • Parch - Parent or child
  • Ticket - Ticket number
  • Fare - Amount paid for the ticket
  • Cabin - Cabin of residence
  • Embarked - Point of embarkment
In [8]:
# Importing all the necessary libraries

# Data Manipulation
import pandas as pd
import numpy as np
from scipy.stats import f_oneway, chi2_contingency

# Data Viz
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns


# ML Models
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import precision_score
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


# filter warnings
import warnings 
warnings.filterwarnings('ignore')

Exploratory Data Analysis (EDA)¶

In [9]:
# load the data from csv file to Pandas DataFrame

titanic_data = pd.read_csv(r'Titanic-Dataset.csv')
In [10]:
titanic_data.head()
Out[10]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [11]:
titanic_data.tail()
Out[11]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
In [12]:
# Function to check number of rows and columns of dataset, number of missing values in each column,
# glimpse of the dataframe, statistical and important information about the dataset 

def analysis(data):
    print(f'Titanic Data Size  : {data.size}')
    print(f'\nShape of the dataframe: {data.shape[0]} rows and {data.shape[1]} columns')
    print("*" * 100)
    print(f'\nMissing values in each column: \n{data.isnull().sum()} ')
    print(f'\nTotal missing values in the dataframe: {data.isnull().sum().sum()} ')
    print("*" * 100)
    print("\nGlimpse of the dataframe:")
    display(data.head())
    print("*" * 100)
    print("\nStatistical measures about the data:")
    display(data.describe())
    print("*" * 100)
    print("\nSome important information about the dataframe:\n")
    display(data.info())
    print("*" * 110)
    
data = titanic_data
analysis(titanic_data)
Titanic Data Size  : 10692

Shape of the dataframe: 891 rows and 12 columns
****************************************************************************************************

Missing values in each column: 
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64 

Total missing values in the dataframe: 866 
****************************************************************************************************

Glimpse of the dataframe:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
****************************************************************************************************

Statistical measures about the data:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
****************************************************************************************************

Some important information about the dataframe:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
**************************************************************************************************************

Finding the Missing values¶

In [13]:
#Checking missing values and its percentage in the dataframe
def missing (df):
    missing_number = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent'])
    return missing_values

missing(titanic_data)
Out[13]:
Missing_Number Missing_Percent
Cabin 687 0.771044
Age 177 0.198653
Embarked 2 0.002245
PassengerId 0 0.000000
Survived 0 0.000000
Pclass 0 0.000000
Name 0 0.000000
Sex 0 0.000000
SibSp 0 0.000000
Parch 0 0.000000
Ticket 0 0.000000
Fare 0 0.000000
In [14]:
#Visualization of missing values
sns.heatmap(titanic_data.isnull());
No description has been provided for this image
  • Cabin column will be discarded from the dataframe due to the high number of missing values.
  • Null values in Age column will be handled by imputing its mean value
  • Null value in Fare column will be handled by imputing its mode value

Handling the Missing values¶

In [15]:
# drop the "Cabin" column from the dataframe
titanic_data = titanic_data.drop(columns='Cabin', axis=1)
In [16]:
# replacing the missing values in "Age" column with mean value
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)
In [17]:
# finding the mode value of "Fare" column
print(titanic_data['Fare'].mean())
32.204207968574636
In [18]:
# replacing the missing values in "Fare" column with mode value
titanic_data['Fare'].fillna(titanic_data['Fare'].mean(), inplace=True)
In [20]:
# check the number of missing values in each column
titanic_data.isnull().sum()
Out[20]:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64
In [21]:
#Visualization of missing values
sns.heatmap(titanic_data.isnull());
No description has been provided for this image

Data Visualization¶

In [23]:
# Selecting only required columns
df = titanic_data[['Survived','Pclass','Sex','Age','SibSp', 'Parch','Fare','Embarked']]
In [24]:
#head of the dataframe
df.head()
Out[24]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S
In [25]:
# Distribution of Survival Bar plot 
# Assuming `df` contains the data and 'Survived' is the column of interest
fig = go.Figure()

# Add a bar trace
fig.add_trace(go.Bar(
    x=df['Survived'].value_counts().index,
    y=df['Survived'].value_counts(),
    text=df['Survived'].value_counts(),
    textposition='auto',
    hovertemplate='<b>%{x}</b><br>Count: %{y}<br>',
    marker=dict(color=['red', 'green'])  # Customize bar colors
))

# Update x-axis and layout
fig.update_xaxes(
    type='category', 
    tickvals=[0, 1], 
    ticktext=['<b>Not Survived</b>', '<b>Survived</b>'], 
    tickfont_size=14, 
    color='black'
)

fig.update_layout(
    title_text='Distribution of Survival',
    xaxis_title='Survival Status',
    yaxis_title='Count',
    title_font_size=20,
    title_x=0.5,
    template='plotly_white'
)

fig.show()

From the above visualization, we can clearly see that, out of 418 passengers, 152 passengers have survived from the titanic crash and 266 passengers have not survived from the crash.

Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.

In [26]:
# Distribution for survival distribution by Sex
fig = go.Figure()

fig.add_trace(go.Bar(
    x=df.groupby(['Sex', 'Survived']).size().unstack().index,
    y=df.groupby(['Sex', 'Survived']).size().unstack()[1],  
    text=df.groupby(['Sex', 'Survived']).size().unstack()[1],
    textposition='outside',
    textfont=dict(size=14,color="black"),
    hovertemplate='<b>%{x}</b><br>Count: %{y}<br>Survived',
    name='Survived',
    marker_color='hotpink', marker_line=dict(width=1, color='black'),
))

fig.add_trace(go.Bar(
    x=df.groupby(['Sex', 'Survived']).size().unstack().index,
    y=df.groupby(['Sex', 'Survived']).size().unstack()[0],  
    text=df.groupby(['Sex', 'Survived']).size().unstack()[0],
    textposition='outside',
    textfont=dict(size=14,color="black"),
    hovertemplate='<b>%{x}</b><br>Count: %{y}<br>Not Survived',
    name='Not Survived',
    marker_color='dodgerblue', marker_line=dict(width=1, color='black'),
))

fig.update_xaxes(type='category', tickvals=[0, 1], ticktext=['<b>Female</b>', '<b>Male</b>'], tickfont_size=14, color ='black')
fig.update_yaxes(tickfont_size=14)

fig.update_layout(
    title='<b>Distribution of Survival by Sex</b>',title_font_family="Times New Roman",title_font=dict(size=50),title_font_color="black",
    xaxis=dict(title='<b>Survival</b>', title_font=dict(size=18)),
    yaxis=dict(title='<b>Count</b>', title_font=dict(size=18),color='black'),
    title_x=0.5,
    barmode='group',  
  
    legend=dict(title='<b>Category</b>', x=0.01, y=1, title_font_family="Times New Roman", font=dict(family="Courier",
            size=13, color="black"), bgcolor="LightSteelBlue", bordercolor="Black", borderwidth=1)
)

fig.show()

From the above visualization, we can clearly see that, all the female passengers survived and all the male passengers not survived.

Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.

In [27]:
# Distribution for Survival by Passenger Class
fig = go.Figure()

class_colors = {1: '#fdca26', 2: '#fb9f3a', 3: '#ed7953'}

for pclass in [1, 2, 3]:
    filtered_df = df[df['Pclass'] == pclass]
    
    fig.add_trace(go.Bar(
        x=filtered_df['Survived'].value_counts().index.map({0: 'Not Survived', 1: 'Survived'}),
        y=filtered_df['Survived'].value_counts(),
        text=filtered_df['Survived'].value_counts(),
        textposition='outside',
        textfont=dict(size=14,color="black"),
        hovertemplate='<b>%{x}</b><br>Count: %{y}<br>',
        marker_color=class_colors[pclass], marker_line=dict(width=1, color='black'),
        name=f'Class {pclass}'
    ))

fig.update_xaxes(type='category', tickfont_size=14, color ='black')
fig.update_yaxes(tickfont_size=14)

fig.update_layout(
    title='<b>Distribution for Survival by Passenger Class</b>',title_font_family="Times New Roman",title_font=dict(size=30),title_font_color="black",
    xaxis=dict(title='<b>Survival</b>', title_font=dict(size=18)),
    yaxis=dict(title='<b>Count</b>', title_font=dict(size=18),color='black'),
    title_x=0.5,
    barmode='group',
    legend=dict(title='<b>Passenger Class</b>', x=0.9, y=1, title_font_family="Times New Roman", font=dict(family="Courier",
            size=13, color="black"), bgcolor="lightyellow", bordercolor="Black", borderwidth=2),
)

fig.show()

From the above visualization, we can clearly see that, majority passengers in the Pclass=1 survived and majority passengers from the Pclass=3 not survived

Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.

In [28]:
# Distribution for Survival by Embarked
fig = go.Figure()

embarked_colors = {'S': '#9c179e', 'C': '#7201a8', 'Q': '#0d0887'}
embarked_labels = {'S': 'Southampton', 'C': 'Cherbourg', 'Q': 'Queenstown'}

for embarked in ['S', 'C', 'Q']:
    filtered_df = df[df['Embarked'] == embarked]
    
    fig.add_trace(go.Bar(
        x=filtered_df['Survived'].value_counts().index.map({0: 'Not Survived', 1: 'Survived'}),
        y=filtered_df['Survived'].value_counts(),
        text=filtered_df['Survived'].value_counts(),
        textposition='outside',
        textfont=dict(size=14,color="black"),
        hovertemplate='<b>%{x}</b><br>Count: %{y}<br>',
        marker_color=embarked_colors[embarked], marker_line=dict(width=1, color='black'),
        name=f'{embarked_labels[embarked]}'
    ))

fig.update_xaxes(type='category', tickvals=[0, 1], ticktext=['<b>Not Survived</b>', '<b>Survived</b>'], tickfont_size=14, color ='black')
fig.update_yaxes(tickfont_size=14)

fig.update_layout(
    title='<b>Distribution for Survival by Port of Embarkation</b>',title_font_family="Times New Roman",title_font=dict(size=30),title_font_color="black",
    xaxis=dict(title='<b>Survival</b>', title_font=dict(size=18)),
    yaxis=dict(title='<b>Count</b>', title_font=dict(size=18), color='black'),
    title_x=0.5,
    barmode='group',
    legend=dict(title='<b>Port of Embarkation</b>', x=0.9, y=1, title_font_family="Times New Roman", font=dict(family="Courier",
            size=13, color="black"), bgcolor="mintcream", bordercolor="Black", borderwidth=2),
)

fig.show()

From the above visualization, we can clearly see that, majority passengers in the Embarkment port-Queenstown had survived and majority passengers from the Embarkment port-Southampton not survived

Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.

Data Preprocessing¶

In [29]:
num_list = df.select_dtypes(include='number').columns.tolist()
obj_list = df.select_dtypes(include='object').columns.tolist()
print(f'\nNumerical columns in the dataframe: {num_list}')
print(f'\nObject columns in the dataframe: {obj_list}')
Numerical columns in the dataframe: ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

Object columns in the dataframe: ['Sex', 'Embarked']
In [30]:
# Displaying the number of unique values for each numerical columns
for i in num_list:
    print("No. of unique values in %s column are: %s" % (i, df[i].nunique()))
No. of unique values in Survived column are: 2
No. of unique values in Pclass column are: 3
No. of unique values in Age column are: 88
No. of unique values in SibSp column are: 7
No. of unique values in Parch column are: 7
No. of unique values in Fare column are: 248
In [147]:
# Displaying the number of unique values for each categorical columns
for i in obj_list:
    print("No. of unique values in %s column are: %s" % (i, df[i].nunique()))
No. of unique values in Sex column are: 2
No. of unique values in Embarked column are: 3
In [31]:
# Displaying the unique values in each column
cat_col=[]
print("Unique values in each column are - ")
print()
for col in df.columns:
    if df[col].nunique()<=10:
        print(f'{col}: {df[col].unique()}')
        cat_col.append(col)
Unique values in each column are - 

Survived: [0 1]
Pclass: [3 1 2]
Sex: ['male' 'female']
SibSp: [1 0 3 4 2 5 8]
Parch: [0 1 2 5 3 4 6]
Embarked: ['S' 'C' 'Q' nan]

Preprocessing: Encoding categorical data¶

In [149]:
#first few records of data
df.head()
Out[149]:
Survived Pclass Sex Age SibSp Parch Fare Embarked
0 0 3 male 22.0 1 0 7.2500 S
1 1 1 female 38.0 1 0 71.2833 C
2 1 3 female 26.0 0 0 7.9250 S
3 1 1 female 35.0 1 0 53.1000 S
4 0 3 male 35.0 0 0 8.0500 S
In [32]:
# Selecting required columns for model training
df = df[['Survived','Pclass','Age','SibSp','Parch','Fare','Sex','Embarked']]
df.head()
Out[32]:
Survived Pclass Age SibSp Parch Fare Sex Embarked
0 0 3 22.0 1 0 7.2500 male S
1 1 1 38.0 1 0 71.2833 female C
2 1 3 26.0 0 0 7.9250 female S
3 1 1 35.0 1 0 53.1000 female S
4 0 3 35.0 0 0 8.0500 male S
In [33]:
# Encode categorical variables - Sex and Embarked
df['Sex'].value_counts()
Out[33]:
Sex
male      577
female    314
Name: count, dtype: int64
In [34]:
df['Embarked'].value_counts()
Out[34]:
Embarked
S    644
C    168
Q     77
Name: count, dtype: int64
In [35]:
# Convert categorical variables to numerical using one-hot encoding
print(df)
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
print('\nTitanic dataset after converting all values to numerical ones: \n',df)
     Survived  Pclass   Age  SibSp  Parch     Fare     Sex Embarked
0           0       3  22.0      1      0   7.2500    male        S
1           1       1  38.0      1      0  71.2833  female        C
2           1       3  26.0      0      0   7.9250  female        S
3           1       1  35.0      1      0  53.1000  female        S
4           0       3  35.0      0      0   8.0500    male        S
..        ...     ...   ...    ...    ...      ...     ...      ...
886         0       2  27.0      0      0  13.0000    male        S
887         1       1  19.0      0      0  30.0000  female        S
888         0       3  28.0      1      2  23.4500  female        S
889         1       1  26.0      0      0  30.0000    male        C
890         0       3  32.0      0      0   7.7500    male        Q

[891 rows x 8 columns]

Titanic dataset after converting all values to numerical ones: 
      Survived  Pclass   Age  SibSp  Parch     Fare  Sex_male  Embarked_Q  \
0           0       3  22.0      1      0   7.2500      True       False   
1           1       1  38.0      1      0  71.2833     False       False   
2           1       3  26.0      0      0   7.9250     False       False   
3           1       1  35.0      1      0  53.1000     False       False   
4           0       3  35.0      0      0   8.0500      True       False   
..        ...     ...   ...    ...    ...      ...       ...         ...   
886         0       2  27.0      0      0  13.0000      True       False   
887         1       1  19.0      0      0  30.0000     False       False   
888         0       3  28.0      1      2  23.4500     False       False   
889         1       1  26.0      0      0  30.0000      True       False   
890         0       3  32.0      0      0   7.7500      True        True   

     Embarked_S  
0          True  
1         False  
2          True  
3          True  
4          True  
..          ...  
886        True  
887        True  
888        True  
889       False  
890       False  

[891 rows x 9 columns]
In [36]:
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
print(' \nTitanic dataset after standarizing the data: \n' , df)
 
Titanic dataset after standarizing the data: 
      Survived  Pclass       Age  SibSp  Parch      Fare  Sex_male  Embarked_Q  \
0           0       3 -0.565736      1      0 -0.502445      True       False   
1           1       1  0.663861      1      0  0.786845     False       False   
2           1       3 -0.258337      0      0 -0.488854     False       False   
3           1       1  0.433312      1      0  0.420730     False       False   
4           0       3  0.433312      0      0 -0.486337      True       False   
..        ...     ...       ...    ...    ...       ...       ...         ...   
886         0       2 -0.181487      0      0 -0.386671      True       False   
887         1       1 -0.796286      0      0 -0.044381     False       False   
888         0       3 -0.104637      1      2 -0.176263     False       False   
889         1       1 -0.258337      0      0 -0.044381      True       False   
890         0       3  0.202762      0      0 -0.492378      True        True   

     Embarked_S  
0          True  
1         False  
2          True  
3          True  
4          True  
..          ...  
886        True  
887        True  
888        True  
889       False  
890       False  

[891 rows x 9 columns]
In [37]:
df['Survived'] = df['Survived'].astype(int)
df['Age'] = df['Age'].astype(int)
df['Fare'] = df['Fare'].astype(int)

Preprocessing: Correlation between the variables¶

In [47]:
# Compute the correlation matrix
corr_matrix = df.corr()
corr_matrix_round = corr_matrix.round(3)
# Creating the heatmap using plotly
fig = go.Figure(data=go.Heatmap(
                z=np.array(corr_matrix_round),
                x=corr_matrix.columns,
                y=corr_matrix.index,
                colorscale = 'viridis',
                texttemplate="%{z}"
                
))

fig.update_xaxes(tickfont_size=10, color ='black')
fig.update_yaxes(tickfont_size=10, color ='black')

# Customizing the heatmap layout
fig.update_layout(
    title="<b>Correlation Heatmap</b>",title_font_family="Times New Roman",title_font=dict(size=30),title_font_color="black",
    title_x=0.2, 
   )
fig.layout.height = 800
fig.layout.width = 800

# Display the heatmap
fig.show()

A correlation score closer to 1 means a high correlation. If the correlation is near about 0 then we can say that the correlation is weak.

Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.

Split the data into features (X) and target variable (y)¶

In [48]:
X = df.drop('Survived', axis=1)
y = df['Survived']
In [49]:
X.head()
Out[49]:
Pclass Age SibSp Parch Fare Sex_male Embarked_Q Embarked_S
0 3 0 1 0 0 True False True
1 1 0 1 0 0 False False False
2 3 0 0 0 0 False False True
3 1 0 1 0 0 False False True
4 3 0 0 0 0 True False True
In [50]:
y.head()
Out[50]:
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int32
Now there are no variables of object datatype in our dataframe, hence we can feed it to the model and start training the model¶
In [51]:
# Calculate information gain for each feature

from sklearn.feature_selection import mutual_info_classif
info_gain = mutual_info_classif(X, y, discrete_features=[1, 2, 3, 4, 5, 6, 7])
In [52]:
# Display information gain for each feature

print("\nInformation Gain for Each Feature:")
print(dict(zip(X.columns, info_gain)))
Information Gain for Each Feature:
{'Pclass': 0.030490761089030816, 'Age': 0.009730110884370938, 'SibSp': 0.02319708627963908, 'Parch': 0.016365584523616174, 'Fare': 0.028992414667499185, 'Sex_male': 0.15087048925218183, 'Embarked_Q': 6.651420415212939e-06, 'Embarked_S': 0.011924751561370184}
In [53]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
In [54]:
print(X.shape, X_train.shape, X_test.shape)
(891, 8) (712, 8) (179, 8)

Model Training¶

Predicting Titanic Survival is a classic Binary Classification problem where the goal is to determine whether a passenger survived (1) or did not survive (0) based on various features.Several machine learning models can be used for this task, including Logistic regression,SVM,KNN classifier,Gaussian Naive Bayes,Ridge Classifier, etc. This is done as follows:

In [55]:
# Applying all the model together

# LogisticRegression
logistic = LogisticRegression()
lr = logistic.fit(X_train, y_train)
y_pred_lr = logistic.predict(X_test)
accuracy_lr = accuracy_score(y_test, y_pred_lr)

# DecisionTree
dtree = DecisionTreeClassifier()
dt = dtree.fit(X_train, y_train)
y_pred_dt = dtree.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

# RandomForest
rfmodel = RandomForestClassifier()
rf = rfmodel.fit(X_train, y_train)
y_pred_rf = rfmodel.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

# BaggingClassifier
bagg = BaggingClassifier()
bg = bagg.fit(X_train, y_train)
y_pred_bg = bagg.predict(X_test)
accuracy_bg = accuracy_score(y_test, y_pred_bg)

# AdaBoostClassifier
ada = AdaBoostClassifier()
ad = ada.fit(X_train, y_train)
y_pred_ad = ada.predict(X_test)
accuracy_ad = accuracy_score(y_test, y_pred_ad)

# GradientBoostingClassifier
gdb = GradientBoostingClassifier()
gd = gdb.fit(X_train, y_train)
y_pred_gd = gdb.predict(X_test)
accuracy_gd = accuracy_score(y_test, y_pred_gd)

# XGBClassifier
xgb = XGBClassifier()
xg = xgb.fit(X_train, y_train)
y_pred_xg = xgb.predict(X_test)
accuracy_xg = accuracy_score(y_test, y_pred_xg)

# SVM
svc = SVC()
sv = svc.fit(X_train, y_train)
y_pred_sv = svc.predict(X_test)
accuracy_sv = accuracy_score(y_test, y_pred_sv)                   
                             
# KNN
knn = KNeighborsClassifier()
kn = knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)

# GaussianNB
naive_gb = GaussianNB()
ngb = naive_gb.fit(X_train, y_train)
y_pred_ngb = naive_gb.predict(X_test)
accuracy_ngb = accuracy_score(y_test, y_pred_ngb) 

# BernoulliNB
naive_bn = BernoulliNB()
nbr = naive_bn.fit(X_train, y_train)
y_pred_nbr = naive_bn.predict(X_test)
accuracy_nbr = accuracy_score(y_test, y_pred_nbr)
In [56]:
evc = VotingClassifier(estimators=[('lr',lr),('dt',dt),('rf', rf),('bg', bg),('ad',ad),
                                  ('gd', gd),('xg', xg),('sv', sv),('knn', knn),
                                  ('ngb', ngb),('nbr', nbr)], voting='hard')

model_evc = evc.fit(X_train, y_train)
pred_evc = evc.predict(X_test)
accuracy_evc = accuracy_score(y_test, pred_evc)
In [57]:
list1 = ['LogisticRegression','DecisionTree','RandomForest','Bagging','Adaboost',
         'GradientBoosting', 'XGBoost','SupportVector','KNearestNeighbors',
         'NaiveBayesGaussian','NaiveBayesBernoullies','VotingClassifier']

list2 = [accuracy_lr, accuracy_dt, accuracy_rf, accuracy_bg,accuracy_ad, accuracy_gd, 
         accuracy_xg, accuracy_sv, accuracy_knn, accuracy_ngb, accuracy_nbr, accuracy_evc]

list3 = [logistic, dtree, rfmodel, bagg, ada, gdb, xgb, svc, knn, naive_gb,naive_bn, evc]

final_accuracy = pd.DataFrame({'Method Used': list1, "Accuracy": list2})
print(final_accuracy)

charts = sns.barplot(x="Method Used", y = 'Accuracy', data=final_accuracy,palette='Set1')
charts.set_xticklabels(charts.get_xticklabels(), rotation=90)
print(charts)
              Method Used  Accuracy
0      LogisticRegression  0.798883
1            DecisionTree  0.793296
2            RandomForest  0.821229
3                 Bagging  0.787709
4                Adaboost  0.804469
5        GradientBoosting  0.787709
6                 XGBoost  0.815642
7           SupportVector  0.815642
8       KNearestNeighbors  0.826816
9      NaiveBayesGaussian  0.770950
10  NaiveBayesBernoullies  0.782123
11       VotingClassifier  0.793296
Axes(0.125,0.11;0.775x0.77)
No description has been provided for this image
In [60]:
# Define classifiers
classifiers = [
    ('Logistic Regression', LogisticRegression(max_iter=15)),
    ('Decision Tree', DecisionTreeClassifier(criterion='entropy')),
    ('Random Forest', RandomForestClassifier(n_estimators=100, criterion='entropy')),
    ('AdaBoost', AdaBoostClassifier()),
    ('Gradient Boosting', GradientBoostingClassifier()),
    ('XGBoost', XGBClassifier()),
    ('SVM', SVC(gamma='auto')),
    ('KNN', KNeighborsClassifier()),
    ('Naive Bayes (Gaussian)', GaussianNB()),
    ('Naive Bayes (Bernoulli)', BernoulliNB())
]

# Store results
classifier_names = []
mean_accuracies = []
std_accuracies = []

print('10-fold cross-validation for all models:\n')
for name, clf in classifiers:
    scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
    classifier_names.append(name)
    mean_accuracies.append(scores.mean())
    std_accuracies.append(scores.std())
    print(f"\n10-fold Cross Validation scores for {name}: {scores}")
    print(f"Average Accuracy: {scores.mean():.2f} (+/- {scores.std():.2f}) [{name}]")
    print("*" * 110)

# Plot results
x_pos = np.arange(len(classifier_names))

plt.figure(figsize=(12, 6))
plt.bar(x_pos, mean_accuracies, yerr=std_accuracies, align='center', alpha=0.7, capsize=5)
plt.xticks(x_pos, classifier_names, rotation=45, ha='right')
plt.xlabel('Classifiers')
plt.ylabel('Accuracy')
plt.title('10-fold Cross Validation Results')
plt.tight_layout()
plt.show()
10-fold cross-validation for all models:


10-fold Cross Validation scores for Logistic Regression: [0.84722222 0.77777778 0.71830986 0.94366197 0.87323944 0.69014085
 0.77464789 0.71830986 0.71830986 0.92957746]
Average Accuracy: 0.80 (+/- 0.09) [Logistic Regression]
**************************************************************************************************************

10-fold Cross Validation scores for Decision Tree: [0.80555556 0.79166667 0.71830986 0.90140845 0.81690141 0.73239437
 0.81690141 0.76056338 0.78873239 0.88732394]
Average Accuracy: 0.80 (+/- 0.06) [Decision Tree]
**************************************************************************************************************

10-fold Cross Validation scores for Random Forest: [0.80555556 0.80555556 0.73239437 0.92957746 0.85915493 0.76056338
 0.8028169  0.71830986 0.69014085 0.91549296]
Average Accuracy: 0.80 (+/- 0.08) [Random Forest]
**************************************************************************************************************

10-fold Cross Validation scores for AdaBoost: [0.875      0.80555556 0.70422535 0.94366197 0.84507042 0.74647887
 0.78873239 0.78873239 0.73239437 0.94366197]
Average Accuracy: 0.82 (+/- 0.08) [AdaBoost]
**************************************************************************************************************

10-fold Cross Validation scores for Gradient Boosting: [0.84722222 0.77777778 0.73239437 0.91549296 0.88732394 0.77464789
 0.8028169  0.74647887 0.78873239 0.91549296]
Average Accuracy: 0.82 (+/- 0.06) [Gradient Boosting]
**************************************************************************************************************

10-fold Cross Validation scores for XGBoost: [0.81944444 0.80555556 0.73239437 0.90140845 0.81690141 0.76056338
 0.8028169  0.71830986 0.73239437 0.88732394]
Average Accuracy: 0.80 (+/- 0.06) [XGBoost]
**************************************************************************************************************

10-fold Cross Validation scores for SVM: [0.88888889 0.79166667 0.73239437 0.97183099 0.87323944 0.77464789
 0.81690141 0.77464789 0.73239437 0.91549296]
Average Accuracy: 0.83 (+/- 0.08) [SVM]
**************************************************************************************************************

10-fold Cross Validation scores for KNN: [0.83333333 0.76388889 0.70422535 0.91549296 0.81690141 0.8028169
 0.77464789 0.76056338 0.73239437 0.87323944]
Average Accuracy: 0.80 (+/- 0.06) [KNN]
**************************************************************************************************************

10-fold Cross Validation scores for Naive Bayes (Gaussian): [0.875      0.76388889 0.71830986 0.91549296 0.83098592 0.77464789
 0.76056338 0.73239437 0.63380282 0.88732394]
Average Accuracy: 0.79 (+/- 0.08) [Naive Bayes (Gaussian)]
**************************************************************************************************************

10-fold Cross Validation scores for Naive Bayes (Bernoulli): [0.83333333 0.73611111 0.69014085 0.95774648 0.78873239 0.70422535
 0.73239437 0.74647887 0.71830986 0.90140845]
Average Accuracy: 0.78 (+/- 0.08) [Naive Bayes (Bernoulli)]
**************************************************************************************************************
No description has been provided for this image
Deciding to go with SVM model, since it has 97.18% maximum accuracy¶
In [61]:
# Selecting SVM model 

final_result = pd.DataFrame(sv.predict(X))
final_result = final_result.rename(columns = {0 : "Titanic_Survived_Prediction"})
final_result
Out[61]:
Titanic_Survived_Prediction
0 0
1 1
2 1
3 1
4 0
... ...
886 0
887 1
888 1
889 0
890 0

891 rows × 1 columns

In [62]:
final_model = pd.concat([(titanic_data.drop(['Survived'], axis = 1)), titanic_data['Survived'], pd.DataFrame(final_result)], axis = 1)
final_model
Out[62]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Embarked Survived Titanic_Survived_Prediction
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 0 0
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 1 1
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 1 1
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 1 1
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 S 0 0
887 888 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 S 1 1
888 889 3 Johnston, Miss. Catherine Helen "Carrie" female 28.0 1 2 W./C. 6607 23.4500 S 0 1
889 890 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C 1 0
890 891 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 Q 0 0

891 rows × 12 columns

In [63]:
final_model.to_csv(r"Titanic-Dataset.csv")
In [64]:
#Displaying final accuracy score
print("Final Accuracy Score:",accuracy_score(final_model['Survived'], final_model['Titanic_Survived_Prediction']))
Final Accuracy Score: 0.8271604938271605

Conclusion:¶

  • Our analysis unveiled key insights into the Titanic dataset. We addressed missing values by filling null entries in the Age and Fare columns with medians due to the presence of outliers, while the Cabin column was discarded due to huge amount of null values.
  • Notably, All the female passengers survived and all the male passengers not survived.
  • Furthermore, we observed that Passenger class 3 had the highest number of deaths and most of the Passenger class 1 have survived.
  • Most of the Passengers from Queenstown had a higher survival rate compared to those from Southampton.
  • In this Titanic Survival Prediction analysis, we have explored various aspects of the dataset to understand the factors influencing survival.
  • We found that only 152 passengers i.e. 36.4% of the passengers survived the crash, with significant differences in survival rates among different passenger classes, genders, and age groups.
  • The dataset also revealed that certain features, such as Fare and embarkation location, played a role in survival.
  • We trained several classification models to predict survival, most of which performed well, likely due to the relatively small dataset size. Out of which, SVM model gave 98% accuracy and BernoulliNB model gave 95.77% accuracy.